ESSNet on Statistical Disclosure Control

Task 5. Improvement of software for micro data


5.a. Big surveys

5.a(1) Standardised anonymisation of microdata sets

This task relates to both, 2.a. and 2.b. The idea is that the creation of anonymised micro data files for researchers should be integrated into the standardised process of the production of statistics in the future. Particularly in the case of annual surveys where such a standardisation would mean considerable advantages for the data producers and the research community.
We will test µ-ARGUS in order to establish whether it can be used as an instrument for the standardised anonymisation of big microdata sets and to eventually identify problems w.r.t. the integration of the software into the production infrastructure. This question will be investigated using the example of the German microcensus which is a 1% sample of the German population (e.g. 1.2 million records).
Partners: DE
Deliverables: A list of proposals for improvements to the software.

5.a. (2) Blocking methods

Some SDC methods for microdata protection are increasingly time consuming when very large surveys have to be protected. Some SDC microdata methods take linear time, while others (like microaggregation or optimal recoding) take time quadratic or, more generally, superlinear in the data set size. Also, disclosure risk assessment for microdata is often superlinear (e.g. record linkage is quadratic).
Blocking is a popular approach to applying superlinear methods to large microdata sets. The idea is to split large data sets into smaller pieces (blocks) of manageable size that can be separately treated in a reasonable time. Blocking should be done in such a way that its impact on data utility is as small as possible.
Usual blocking procedures involve selecting a number of variables in the data set, called blocking variables. Then records are sorted by the blocking variables and the sorted data set is divided as many times as needed to obtain manageable subsets. Normally, a block is defined as a subset of records sharing a particular combination of values of the blocking variables. Building blocks from blocking variables has several drawbacks:
  • The choice of blocking variables may not be obvious;
  • Blocks obtained in this way may fail to adapt to the distribution of data (very heterogeneous blocks);
  • The size of some blocks may be too small or too big for some purposes.
Some blocking approaches based on clustering theory have been proposed in the literature (Cohen and Richman, 2002; McCallum et al., 2000) mainly designed for use with record linkage. They aim at obtaining blocks that are more similar to the natural clusters present in the data. We plan to use this type of blocking via clustering and, more specifically, via 2d trees. Solanas et al. (2006) were able to show that 2d trees could be used to get very homogeneous blocks and thus reduce information loss caused by blocking.
Partners: URV, CBS, DE. URV for the report, NL for the implementation in ARGUS and DE for the testing
Deliverables: We will produce a report in the first year and an implementation of a cluster-based blocking mechanism in the second year.
References:
W. Cohen and J. Richman (2002), Learning to match and cluster high-dimensional data sets for data integration, in ACM SIGKDD'2002.
A. McCallum, K. Nigam and L. Ungar (2000), Efficient clustering of high-dimensional data sets with application to reference matching, in ACM SIGKDD'2000, pp. 169-178.
A. Solanas, A. Martínez-Ballesté, J. M. Mateo-Sanz and J. Domingo-Ferrer (2006), A 2d-tree-based blocking method for microaggregating very large data sets, in Proceedings of ARES/DAWAM 2006, IEEE Computer Society, pp. 922-928.

5.b. Alternative risk models or Microdata dissemination strategies.

Microdata users are very diverse and an efficient data dissemination strategy should take full account of this heterogeneity as has already been made clear in the literature. The distinction between Public Use Files and Microdata Files for Research starts from different scenarios, different risk models, possibly different masking procedures and different data utility / information loss requirements.
For social microdata masking and risk/utility assessment for Public Use files and Microdata Files for Research will be further investigated.
Survey specific data dissemination strategies for enterprise microdata will also be investigated. A unified framework for risk assessment and data protection will be developed. Further analysis of protection methods that take into account information loss will be undertaken.
Partners: IT, UK
Deliverables: One report will be delivered at the end of both years to illustrate the progress made for social surveys. Examples on real surveys will be presented. At the end of each year one report will be produced on risk models and data protection for enterprise microdata.

5.c. Record linkage

Roughly speaking, record linkage consists of linking each record a in file A (protected file) to a record b in file B (original file). The pair (a,b) is a match if b turns out to be the original record corresponding to a. To apply this method to measure the risk of identity disclosure, it is assumed that an intruder has got an external dataset sharing some (key or outcome) variables with the released protected dataset and containing some identifier variables (e.g. passport number, full name, etc.). The intruder is assumed to try to link the protected dataset with the external dataset using the shared variables. The number of matches gives an estimation of the number of protected records whose respondent can be re-identified by the intruder. Accordingly, disclosure risk is defined as the proportion of matches among the total number of records in A.
There are two main types of records linkage: distance-based record linkage and probabilistic record linkage. Within ESSNet, we propose to implement distance-based record linkage in µ-ARGUS.
Distance-based record linkage consists of linking each record a in file A to its nearest record b in file B. Therefore, this method requires a definition of a distance function for expressing nearness between records. This record-level distance can be constructed from distance functions defined at the variable level. Construction of record-level distances requires standardizing variables to avoid scaling problems and assigning each variable a weight on the record-level distance.
Partners: (URV, NL, DE)
Deliverable: Software for distance-based record linkage produced within the CASC project (deliverable 1.2 D6) will be included in µ-ARGUS during the first year. The appropriate documentation will be included in the µ-ARGUS manual.